[Quantization] Support Quark W8A8 INT8 MoE inference#36320
[Quantization] Support Quark W8A8 INT8 MoE inference#36320tjtanaa merged 5 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request adds support for Quark W8A8 INT8 MoE inference, which is a valuable addition. The changes are well-structured and address the missing functionality for this quantization scheme. My review focuses on improving code clarity and maintainability. I've pointed out a few instances of misleading documentation and variable names, as well as one case of unreachable code. Addressing these points will enhance the long-term quality of the codebase.
Note: Security Review did not run due to the size of the PR.
63483a5 to
f53e7e3
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
f53e7e3 to
1125a4e
Compare
1125a4e to
526ba9c
Compare
526ba9c to
c619159
Compare
|
All review comments addressed, rebased onto latest main, and added integration test with a tiny MoE model. GSM8K 8-shot accuracy re-verified. @BowenBao |
Signed-off-by: kangletian <Letian.Kang@amd.com>
c619159 to
2e825b2
Compare
BowenBao
left a comment
There was a problem hiding this comment.
LGTM, please fix pre-commit issues.
|
Hi @JoursBleu, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
tjtanaa
left a comment
There was a problem hiding this comment.
LGTM, but pre-commit has to be fixed before merging.
|
Hi @JoursBleu, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: kangletian <Letian.Kang@amd.com>
Purpose
MoE models quantized by AMD Quark with W8A8 INT8 (per-channel weight + per-token dynamic activation) cannot be loaded in vLLM. For example, quantizing MiniMax-M2.1 (456B MoE) with Quark's
ptpc_int8scheme produces a model that fails at startup with:quark.py:_get_scheme_from_config()only recognizes static per-tensor W8A8 INT8 via_is_static_tensor_w8a8, missing the dynamic per-token + per-channel weight config →RuntimeError("Unsupported quantization scheme")quark_moe.py: No INT8 MoE method exists (only Fp8 and OCP_MX) →RuntimeError("Unsupported FusedMoe scheme")fused_moe/utils.py:_int8_quantize()hard-assertsper_act_tokenwhenblock_shape is None, blocking per-tensor static/dynamic INT8 pathsThis PR adds:
_is_dynamic_per_token_w8a8()detection and routing toQuarkW8A8Int8(is_static_input_scheme=False)inquark.pyQuarkW8A8Int8MoEMethodinquark_moe.pysupporting both per-tensor and per-channel weight scales_int8_quantize()for per-token / static per-tensor / dynamic per-tensor pathstrust_remote_code=False→Trueinget_config()call (required for custom models like MiniMax-M2.1)Test Plan
Test Result
Tested on MiniMax-M2.1 (456B MoE) quantized with Quark ptpc_int8, served with vLLM on 8x GPU:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.